Skip to content

feat: add dynamic shapes kernel specialization strategy for TRT-RTX#4184

Merged
lanluo-nvidia merged 2 commits intopytorch:mainfrom
tp5uiuc:feat/trtrtx-dynamic-shapes-strategy
Apr 21, 2026
Merged

feat: add dynamic shapes kernel specialization strategy for TRT-RTX#4184
lanluo-nvidia merged 2 commits intopytorch:mainfrom
tp5uiuc:feat/trtrtx-dynamic-shapes-strategy

Conversation

@tp5uiuc
Copy link
Copy Markdown
Contributor

@tp5uiuc tp5uiuc commented Apr 12, 2026

Description

Expose IRuntimeConfig.setDynamicShapesKernelSpecializationStrategy() through the Torch-TensorRT Python API for TensorRT-RTX builds.

Users can now control how shape-specialized kernels are compiled at runtime for dynamic shapes via the new dynamic_shapes_kernel_specialization_strategy compilation setting:

  • "lazy" (default): Compile shape-specialized kernels in the background, use fallback until ready
  • "eager": Compile immediately (blocking)
  • "none": Always use fallback kernels, never specialize

Depends on: #4180 (runtime cache API — provides the IRuntimeConfig infrastructure)

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist:

  • My code follows the style guidelines of this project (You can use the linters)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes
  • I have added the relevant labels to my PR in so that relevant reviewers are notified

@meta-cla meta-cla Bot added the cla signed label Apr 12, 2026
@github-actions github-actions Bot added documentation Improvements or additions to documentation component: tests Issues re: Tests component: conversion Issues re: Conversion stage component: core Issues re: The core compiler component: build system Issues re: Build system component: api [Python] Issues re: Python API component: runtime component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Apr 12, 2026
@github-actions github-actions Bot requested a review from cehongwang April 12, 2026 20:48
Comment thread tests/py/dynamo/runtime/test_001_dynamic_shapes_kernel_strategy.py
@github-actions github-actions Bot requested a review from zewenli98 April 14, 2026 17:44
@tp5uiuc tp5uiuc force-pushed the feat/trtrtx-dynamic-shapes-strategy branch from c222c72 to 385eec6 Compare April 15, 2026 18:54
tp5uiuc and others added 2 commits April 20, 2026 08:58
Expose IRuntimeConfig.setDynamicShapesKernelSpecializationStrategy()
through the Torch-TensorRT Python API. Users can now control how
shape-specialized kernels are compiled at runtime for dynamic shapes
on TensorRT-RTX via the new `dynamic_shapes_kernel_specialization_strategy`
compilation setting ("lazy", "eager", or "none").

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address review feedback: compile with torchtrt.Input min/opt/max
ranges so dynamic shapes are actually exercised.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tp5uiuc tp5uiuc force-pushed the feat/trtrtx-dynamic-shapes-strategy branch from 385eec6 to d7619ca Compare April 20, 2026 15:58
@tp5uiuc tp5uiuc marked this pull request as ready for review April 20, 2026 16:09
Copy link
Copy Markdown
Collaborator

@lanluo-nvidia lanluo-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, one minor comment.

hardware_compatible (bool): Build the TensorRT engines compatible with GPU architectures other than that of the GPU on which the engine was built (currently works for NVIDIA Ampere and newer)
timing_cache_path (str): Path to the timing cache if it exists (or) where it will be saved after compilation. Not used for TensorRT-RTX.
runtime_cache_path (str): Path to the runtime cache for TensorRT-RTX JIT compilation results. Not used for standard TensorRT.
dynamic_shapes_kernel_specialization_strategy (str): Strategy for dynamic shape kernel specialization at runtime (TensorRT-RTX only). Options: "lazy", "eager", "none". Default: "lazy".
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a warning or check in case user configured dynamic_shapes_kernel_specialization_strategy in TensorRT

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good suggestion Lan, I have a followup task to emit user warnings for

  1. timing cache used in TRT-RTX
  2. runtime cache used in standard TRT
  3. dynamic shape strategy used in standard TRT
  4. cudagraphs flag used in standard TRT
    so that its easier to review the change/behavior. I will put it in then

@lanluo-nvidia lanluo-nvidia merged commit 8903707 into pytorch:main Apr 21, 2026
84 checks passed
tp5uiuc added a commit to tp5uiuc/TensorRT that referenced this pull request Apr 22, 2026
Address the structural PR feedback by extracting TensorRT-RTX-specific
IRuntimeConfig state into its own type and collapsing the per-feature
appliers that previously scattered `#ifdef TRT_MAJOR_RTX` through
TRTEngine.

What
 - New core/runtime/TRTRuntimeConfig.{h,cpp} owns the IRuntimeConfig
   shared_ptr plus (on TRT-RTX) the IRuntimeCache, runtime-cache path,
   dynamic shapes kernel strategy, CUDA graph strategy, and the
   rtx_native_cudagraphs_disabled one-shot flag. All per-feature
   appliers live there as public members and are no-ops on non-RTX
   builds, keeping the only `#ifdef TRT_MAJOR_RTX` scatter contained
   in this new file.
 - Strategy fields are now strongly-typed enums
   (`DynamicShapesKernelStrategy`, `CudaGraphStrategyOption`) with
   matching `to_string`/`to_int` helpers, validated at engine
   construction via `to_dynamic_shapes_kernel_strategy` / `to_cuda_
   graph_strategy_option` rather than raw int ranges.
 - `TRTEngine::recreate_execution_context` is now backend-agnostic:
   it calls `runtime_cfg.ensure_initialized`, applies the allocation
   strategy, and creates the execution context via
   `createExecutionContext(IRuntimeConfig*)`. Both standard TensorRT
   and TRT-RTX go through this uniform path; only the three RTX-only
   setters (`setRuntimeCache`, `setDynamicShapesKernel
   SpecializationStrategy`, `setCudaGraphStrategy`) stay behind an
   `#ifdef TRT_MAJOR_RTX` guard inside the struct.
 - `~TRTEngine` now wraps cleanup in try/catch and delegates cache
   persistence to `TRTRuntimeConfig::save_runtime_cache_nothrow`, so
   stack unwinding can no longer propagate a cache-save failure out
   of the destructor.
 - `save_runtime_cache_nothrow` uses `std::filesystem` + atomic
   `tmp+rename` only; file locking is out of scope for this PR and
   will be introduced in a follow-up once we pick a portable
   mechanism.
 - `is_monolithic_capturable` asserts `exec_ctx` is non-null; the
   three RTX-only appliers `TORCHTRT_ASSERT` that `config` is live
   before dereferencing.
 - `disable_rtx_native_cudagraphs` persists the runtime cache before
   flipping the strategy so any kernels compiled under the internal
   capture survive to the next reload.
 - `TRTEngine::to_str` now emits human-readable strategy names (via
   `to_string(enum)`) instead of integer codes.
 - New serialization indices (`RUNTIME_CACHE_PATH_IDX`, `DYNAMIC_
   SHAPES_KERNEL_STRATEGY_IDX`, `CUDA_GRAPH_STRATEGY_IDX`) are now
   `#ifdef TRT_MAJOR_RTX`-gated in runtime.h, register_jit_hooks.cpp,
   the FlattenedState tuple, the serialize/deserialize constructors,
   and `__obj_flatten__`. Standard TRT builds keep `SERIALIZATION_LEN
   == 11` so engines serialized there do not carry RTX-only slots.
 - Python `_TorchTensorRTModule` reads the RTX-only index accessors
   and writes the RTX-only engine-info slots only when
   `ENABLED_FEATURES.tensorrt_rtx` is true. Standard TRT users see
   no new behavior at runtime.
 - Deduplicated `_compiler.py` arguments after rebase on upstream
   main where PR pytorch#4184 had already added
   `dynamic_shapes_kernel_specialization_strategy`. Kept one copy of
   each arg; `cuda_graph_strategy` is threaded through all three
   compile() entry points.

Build + tests
 - RTX build on A100 / L40S: libtorchtrt.so and libtorchtrt_
   runtime.so link clean, no `#ifdef` diagnostics. Pre-commit checks
   pass (clang-format, black, isort, ruff, mypy, typos, buildifier).
 - All 35 runtime-cache/strategy tests pass; regression across
   test_000_runtime_cache.py (Python runtime), test_002_cudagraphs_
   cpp.py, test_005_dynamic_allocation.py is green.

Addresses review comments on PR pytorch#4202:
 - Guarding of new IDX entries and Python accessors on
   TRT_MAJOR_RTX / ENABLED_FEATURES.tensorrt_rtx.
 - Encapsulation of RTX-specific state in a dedicated type with
   enumerated strategies and transparent standard-TRT/RTX behavior.
 - Destructor exception safety.
 - Unification of the execution-context creation path via
   IRuntimeConfig.
 - Removal of file locking for runtime-cache persistence.
 - Debug asserts before dereferencing the live IRuntimeConfig.
 - Human-readable to_str output.
 - save_runtime_cache invoked from disable_rtx_native_cudagraphs.
tp5uiuc added a commit to tp5uiuc/TensorRT that referenced this pull request Apr 22, 2026
Address PR review comments that asked the new C++ runtime tests be
folded into existing feature-level files rather than shipped as
parallel `*_cpp.py` files.

What
 - Merge `test_000_runtime_cache_cpp.py` into the existing
   `test_000_runtime_cache.py`. The file already covered the Python
   runtime path; two new classes (`TestRuntimeCacheCppPersistence`,
   `TestCppSerializationIndices`) cover the C++ runtime path via
   `use_python_runtime=False`, and the serialization-index
   assertions. Skip on non-RTX builds.
 - Fold the C++ runtime cases for dynamic shapes kernel
   specialization strategy into `test_001_dynamic_shapes_kernel_
   strategy.py` (introduced upstream in PR pytorch#4184). Two new classes
   (`TestDynamicShapesKernelStrategyCpp`, `TestDynamicShapesKernel
   StrategyCppInvalidValue`) exercise lazy/eager/none end-to-end and
   reject invalid strategy names. The pre-existing Python runtime
   tests remain untouched.
 - Rename `test_000_cuda_graph_strategy.py` to `test_001_cuda_graph_
   strategy.py` to match the `test_001_*` convention used for L1
   RTX-only features. When upstream lands the Python runtime
   counterpart (PR pytorch#4187), both sets fold into the same file.
 - Add model-level tests: `test_runtime_cache_models.py` gains a
   `TestRuntimeCacheCppModels` class exercising ResNet18 through the
   C++ runtime with warm-cache roundtrip. `test_dynamic_shapes_
   kernel_strategy_models.py` gains `TestDynamicShapesKernelStrategy
   CppModels` covering lazy/eager/none on ResNet18 via the C++
   runtime.

Verified
 - 35 passed / 3 skipped in the runtime/ tests (merged file plus
   test_001 strategy files).
 - No regression in test_002_cudagraphs_cpp.py (8 passed) or
   test_005_dynamic_allocation.py (1 passed).

Addresses PR pytorch#4202 review comments asking for test file merges and
the addition of model-level runtime_cache_models.py /
dynamic_shapes_kernel_strategy_models.py coverage.
tp5uiuc added a commit to tp5uiuc/TensorRT that referenced this pull request Apr 22, 2026
Follow-up to 54f9ccd / 1fa8c82 addressing the second batch of PR
pytorch#4202 review feedback. Pure refactor with no user-visible behavior
change; all tests green on A100 (35 passed / 3 skipped + 9 regression
passed).

TRTEngine
 - Constructor signature simplified: three separate `runtime_cache_path`
   / `dynamic_shapes_kernel_strategy` / `cuda_graph_strategy` parameters
   collapsed into a single `TRTRuntimeConfig runtime_cfg` sink parameter.
   The forwarding ctor std::moves it into the primary ctor, which
   std::moves it into the member.
 - String sink parameters (mod_name, serialized_engine, serialized_
   metadata) taken by value and moved into members / slugify.
 - Deserialization constructor routes through the new free function
   make_runtime_config_from_serialized, which internalizes the
   TRT_MAJOR_RTX-gated index reads so the constructor itself stays
   unguarded.
 - FlattenedState uses a single TRTRTX_FLATTENED_STATE_EXTRAS macro for
   the three RTX-only tuple entries instead of duplicating the first
   eleven entries across two branches.
 - Destructor restored to the pre-refactor structure: torch::cuda::
   synchronize runs outside a try block and runtime_cfg.save_runtime_
   cache (now noexcept by signature) is called directly. Exception
   safety is guaranteed by the member's type, not by a defensive
   try/catch.
 - __obj_flatten__ and serialize cast enum values via
   std::underlying_type_t<...> instead of int so serialization stays
   in lockstep with any future underlying-type change on the enums.

TRTRuntimeConfig
 - Conversion helpers take std::underlying_type_t<Enum> (the declared
   32-bit integer type) instead of raw int. Callers at serialization
   boundaries explicitly std::stoi / static_cast into the right type.
 - [[nodiscard]] added to to_string, to_dynamic_shapes_kernel_strategy,
   to_cuda_graph_strategy_option, uses_internal_capture, is_monolithic_
   capturable, to_str, and make_runtime_config_from_serialized.
 - to_string default cases now TORCHTRT_CHECK(false, ...) with the
   unexpected integer value; std::unreachable is C++23.
 - set_execution_context_allocation_strategy is now const.
 - Cache I/O split into two layers:
     - Free functions load_runtime_cache(path, cache) and
       save_runtime_cache(path, cache) perform the raw std::filesystem
       I/O and use TORCHTRT_CHECK on failure -- exception-propagating,
       easier to test in isolation.
     - Member TRTRuntimeConfig::save_runtime_cache() is a noexcept
       wrapper that calls the free function and swallows exceptions via
       try/catch -- safe from a destructor.
   The _nothrow suffix is dropped from the member name (the signature
   now carries that contract).
 - write_to_str(ostream&) replaced by two functions: a const-correct
   to_str() -> std::string, and a free operator<<(ostream&, const
   TRTRuntimeConfig&) that wraps it with "Runtime cfg { ... }"
   delimiters. TRTEngine::to_str streams the config via the free
   operator.

Python
 - _settings.py: removed a duplicated dynamic_shapes_kernel_
   specialization_strategy field and its duplicated docstring left
   over from the upstream rebase of PR pytorch#4184 into our changes.

Covers review comments 3126538200, 3126541782, 3126547529, 3126549147,
3126682329, 3126683329, 3126693226, 3126715369, 3126725953, 3126736626,
3126738422, 3126745230, 3126747553, 3126749405, 3126764831, 3126772536,
3126786564, 3126803652, 3126816780, 3126818065, 3126818561, 3126819429,
3126823781, 3126840987, 3126846827.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend: TensorRT-RTX cla signed component: api [Python] Issues re: Python API component: build system Issues re: Build system component: conversion Issues re: Conversion stage component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: runtime component: tests Issues re: Tests documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants